This is an article that I have been excited to write for quite some time! While the model and classification techniques are nothing new, Naive Bayes (and Bayes Nets generalizations) build upon all of the intuitions that I have laid out in past posts on probability theory, classic frequentist statistical inference, and bayesian inference. I highly recommend that you take some time to read them over if your background in statistics, probability, and bayesian inference is rusty (I will particularly be assuming the reader is comfortable with bayesian inference).
With that said let's get started!
I want to begin by introducing a simple toy classification problem that can help create a hollistic picture of what we are trying to accomplish. Imagine that you are working as a Data Scientist for an energy company, and you are trying to predict whether a given customer will be a "big consumer" of energy that month, based on there energy consumption after 10 days. Now "big consumer" clearly is a rather arbitrary concept, but let's say that you have spoken to management and they are classifying a big consumer as anyone using over 2000 kWh in a given month. To make this a bit more clear, we may have historical data that looks something like:
| Month | Energy Usage (After 10 Days) | Energy Usage (End of Month) | Big Spender? |
|---|---|---|---|
| 1 | 1200 | 3000 | Y |
| 2 | 450 | 890 | N |
| 3 | 600 | 1745 | N |
| 2 | 800 | 2100 | Y |
| 1 | 100 | 290 | N |
| 3 | 724 | 2020 | Y |
| 1 | 1800 | 1950 | N |
| . | . | . | . |
| . | . | . | . |
| . | . | . | . |
As a side note, there are many reasons that we may want to classify a consumer as a big spender early in the month, particularly if we want to offer them insights into why they are consuming more energy, tips as to how they can reduce energy consumption, or a paper report in the mail (proven to be more engaging, while also the most costly form of engagement) that may help them reduce energy and and hence their current bill. Reasons aside, the question that arises is: After 10 days how do you classify user's as big spenders?
Of course there are many options! We could train a logistic regression model to find the pattern between our inputs, $X$ (energy usage after 10 days), and our output, $Y$ (big spender?), or we could fit a decision tree to the data, or an neural network-our choices are endless. As the data scientist your job is to understand the context of the problem, edge cases, business requirements, etc, and select the most appropriate model. In this case, considering the subject of the post, we will chose to use a Naive Bayes Model.
Note, I want to again make clear that this is of course a toy problem. Training any model to make robust predictions on a single feature is a tall order, working well only if the feature happens to significantly explain the variance in the dependent variable to a high degree$^1$ (for more on this see my post on dimensionality reduction, particularly the example pertaining to height vs. urefu, where one almost perfectly explains the other).
With that said, let's take a look at some of this data (I want to note that as always you can choose to show or hide the code that makes up this post by scroll to the bottom and clicking the corresponding button). We can see our two groups (big spenders and not big spenders) shown below:
import numpy as np
from scipy.stats import bernoulli, binom, norm
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rc, animation
from IPython.core.display import display, HTML
from _plotly_future_ import v4_subplots
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.figure_factory as ff
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")
big_spender_mean = 2000
big_spender_std = 800
num_big_spenders = 10000
big_spender_usage_ten_days = np.random.normal(big_spender_mean, big_spender_std, num_big_spenders)
low_spender_mean = 200
low_spender_std = 100
num_low_spenders = 20000
low_spender_usage_ten_days = np.random.normal(low_spender_mean, low_spender_std, num_low_spenders)
trace1 = go.Histogram(
x=big_spender_usage_ten_days,
nbinsx=100,
name="Big Spender",
marker_color='blue',
)
trace2 = go.Histogram(
x=low_spender_usage_ten_days,
nbinsx=100,
name="Low Spender",
marker_color='crimson',
)
data = [trace1, trace2]
layout = go.Layout(
width=750,
height=450,
title="Big Spender vs. Non Big Spender Distribution",
xaxis=dict(title="Usage after 10 days (kWh)"),
yaxis=dict(title="Number of Customers"),
barmode='stack'
)
fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
We can calculate a few descriptive statistics from the above data and see that our low spenders have a mean of ~$200$ kWH of energy used after 10 days, while our big spenders have a mean of $200$ kWh and a much higher standard deviation (spread), $800$ kWH. For the sake of this problem our data was generated via a gaussian process (via numpy.random.normal), but in reality we could expect to see more of a long tail distribution for our big spenders (in other words, a business insight we could gleam is that big spenders will consist mainly of homes that have used one to two thousand kWh of energy after 10 days, but a few very small cases of large companies and office facilities using twenty times more energy after 10 days. Data that is generated via a gaussian process does not allow for that type of spread, so our process surely is not gaussian in nature).
Now that we have seen the data set that we are provided to work with, let's restate the question we are trying to answer:
In the next month, how can we classify customers as either big or low spenders based on this past dataset?
In other words, next month if we are given a customer who has used $700$ kWh in the first 10 days, do we predict that they are going to be a big spender?
Now, one of the main jobs of a data scientist, applied mathematician, or really any analytical problem solve, is to place a qualitive question within a mathematical framework. Here we can put this into a probabilistic context, by stating that our goal is to find:
$$P(\text{Big Spender} \mid \text{Usage after 10 days})$$Where it is common to refer to our data, usage after 10 days, as $X$, and our classification that we are trying to predict, big spender, as $Y$:
$$P(Y \mid X)$$In english (but in a mathematical context), we can define our goal as:
Find the probability that a user is a big spender, given their first 10 days of usage.
How can we go about finding that? Well, let's actually think for a minute about what that quantity is defined as based on probability theory. Via the product rule$^2$, or more commonly referred to as the rule of conditional probability, we can see that:
$$P(XY) = P(X \mid Y) \cdot P(Y) = P(Y \mid X) \cdot P(X)$$And via some simple algebra we arrive at:
$$P(Y \mid X) = \frac{P(X \mid Y) \cdot P(Y)}{P(X)}$$What do each of these quantities represent, in english? We have:
$$P(X \mid Y) \longrightarrow \text{Probability of observing our data, X, given the consumer is a big spender}$$$$P(Y) \longrightarrow \text{Prior probability that the user a big spender}$$$$P(X) \longrightarrow \text{Probability of observing the data we did (across all types of spenders}$$In a very general sense, this is the framework of bayesian inference (and the equation above is often refered to as bayes rule). In what I find to be the most intuitive notation, we can write the equation as:
$$P(H \mid DX) = \frac{P(D \mid HX) \cdot P(H \mid X)}{P(D \mid X)}$$Where $H$ represents a particular hypothesis that we are interested in (someone is a big spender), $D$ represents our observed data, and $X$ represents all background information that our probabilities are conditional on. Each term is generally referred to as:
$$\overbrace{P(H \mid DX)}^{\text{posterior probability}} = \frac{\overbrace{P(D \mid HX)}^\text{likelihood} \cdot \overbrace{P(H \mid X)}^\text{prior probability}}{\underbrace{P(D \mid X)}_\text{probability of data}}$$Now for an incredibly enlightening take on probability theory as extended logic (where the above notation is taken from), I highly recommend checking out Probability Theory: The Logic of Science.
With that said, why is is this relationship useful? Well, as stated earlier it can actually allow us to conduct a general inference. To see what I mean and build a bit of intuition around this process, I want you to think about what $P(X \mid Y)$ means for a moment. Do we have enough information to calculate it?
Keep that thought in mind and take a look at the distribution plot below. We have essentially fit two normal distributions to our historical data, one distribution was fit to the big spenders, the other to the non big spenders:
hist_data = [big_spender_usage_ten_days, low_spender_usage_ten_days ]
group_labels = ['Big Spender', 'Non Big Spender']
group_colors = ['blue', 'crimson']
fig_traces = ff.create_distplot(
hist_data,
group_labels,
colors=group_colors,
bin_size=5,
curve_type='normal'
)
layout = go.Layout(
autosize=False,
width=750,
height=450,
title="Big Spender vs. Non Big Spender Distribution",
xaxis=dict(title="Usage after 10 days (kWh)"),
yaxis=dict(title="Number of Customers"),
barmode='stack'
)
fig = go.Figure(data=fig_traces, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
What exactly do I mean by fit a distribution to our data? Well remember that the normal distribution is defined mathematically as:
$$f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} exp(-\frac{(x-\mu)^2}{2\sigma^2})$$And it is entirely parameterized by the mean and variance (in the univariate case). So, if we calculate the mean and variance for both of our classes respectively (big spenders and non big spenders), we can then plot the corresponding normal distributions fit to those parameters. In other words, we are finding two normal distributions:
$$\text{Big Spender} \longrightarrow f(x \mid \mu_{BS}, \sigma_{BS}^2) = \frac{1}{\sqrt{2 \pi \sigma_{BS}^2}} exp(-\frac{(x-\mu_{BS})^2}{2\sigma_{BS}^2})$$$$\text{Low Spender} \longrightarrow f(x \mid \mu_{LS}, \sigma_{LS}^2) = \frac{1}{\sqrt{2 \pi \sigma_{LS}^2}} exp(-\frac{(x-\mu_{LS})^2}{2\sigma_{LS}^2})$$Where our big spender distribution is parameterized via $\mu_{BS} = 2000$ and $\sigma_{LS} = 800$, and our non big spender distribution is parameterized via $\mu_{BS} = 200$ and $\sigma_{LS} = 100$.
big_spender_mean = big_spender_usage_ten_days.mean()
big_spender_std = big_spender_usage_ten_days.std()
low_spender_mean = low_spender_usage_ten_days.mean()
low_spender_std = low_spender_usage_ten_days.std()
usage = np.arange(-300, 5000, 1)
big_spender_normal_dist = norm.pdf(usage, big_spender_mean, big_spender_std)
low_spender_normal_dist = norm.pdf(usage, low_spender_mean, low_spender_std)
trace2 = go.Scatter(
x=usage,
y=big_spender_normal_dist,
marker = dict(
color = 'blue',
),
fill='tozeroy',
name="Normal Distribution fit to Big Spender Data"
)
trace1 = go.Scatter(
x=usage,
y=low_spender_normal_dist,
marker = dict(
color = 'crimson',
),
fill='tozeroy',
name="Normal Distribution fit to Low Spender Data"
)
data = [trace1, trace2]
layout = go.Layout(
legend=dict(x=0.5, y=0.97),
width=750,
height=450,
title="Normal Distributions fit to Big and Non Big Spenders Historical Data",
xaxis=dict(title="Usage in first 10 days of month"),
yaxis=dict(title=r"$\text{Probability Density: } f(x \mid \mu, \sigma^2)$")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
Those distributions above are $P(X \mid Y)$! In other words, we have one of the key pieces to finding $P(Y \mid X)$. Now, how can that be useful? Well, say we are given a consumer that we want to classify; they have used $250$ kWh of energy in the first 10 days of the month. What is the likelihood, $P(X \mid Y)$, of the data point for each respective distribution? Let's take a look at the plot below:
trace2 = go.Scatter(
x=usage,
y=big_spender_normal_dist,
marker = dict(
color = 'blue',
),
fill='tozeroy',
name="Normal Distribution fit to Big Spender Data"
)
trace1 = go.Scatter(
x=usage,
y=low_spender_normal_dist,
marker = dict(
color = 'crimson',
),
fill='tozeroy',
name="Normal Distribution fit to Low Spender Data"
)
offset = 300
trace3 = go.Scatter(
x=[usage[250 + offset]],
y=[low_spender_normal_dist[250 + offset]],
marker = dict(
color = 'green',
size=10
),
mode="markers",
name=f"Likelihood of observed data point, given low spender: {round(low_spender_normal_dist[250 + offset], 5)}"
)
trace4 = go.Scatter(
x=[usage[250 + offset]],
y=[big_spender_normal_dist[250 + offset]],
marker = dict(
color = 'gold',
size=10
),
mode="markers",
name=f"Likelihood of observed data point, given big spender: {round(big_spender_normal_dist[250 + offset], 5)}"
)
data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
legend=dict(
x=0.5, y=0.97,
font_size=10,
),
width=750,
height=450,
title="Normal Distributions fit to Big and Non Big Spenders Historical Data",
xaxis=dict(title="Usage in first 10 days of month"),
yaxis=dict(title=r"$\text{Probability Density: } f(x \mid \mu, \sigma^2)$")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
We can see that, given the big spender distribution the likelihood of a consumer using $250$ kWh of energy in the first 10 days is $0.00005$. On the other hand, the likelihood of a non big spender using $250$ kWh in the first 10 days is $0.0035$. In other words:
$$P(\text{250 kWh in first 10 days} \mid \text{Low spender}) > P(\text{250 kWh in first 10 days} \mid \text{High spender})$$$$P(X \mid Y_{LS}) > P(X \mid Y_{BS})$$At this point the gears should be turning and you should start to see potential ways to create an algorithmic decision procedure based on these likelihoods. A simple procedure would be to just predict that the consumer belongs to the class corresponding to a higher likelihood.
However, let's slow down for a moment. Clearly, even in this simple example there is a good deal of overlap between the two distributions. In the real world, there will be even more overlap and hence more ambiguity that the model is most likely going to struggle to handle. One solution, as mentioned earlier, is to include more features. You can imagine that time of year will surely effect this prediction. For instance, if we are in the month of June, the first 10 days are likely to consume less energy than the last 10 days (because as the month goes on, the temperature rises and air conditioning is more likely to be utilized). Similarly, a weather forecast could be included and a forecasted heat wave or cold front would clearly impact our prediction). So clearly adding more features would be beneficial; the question is, how exactly would can we achieve this?
Fortunately, extending our idea of likelihood maximization is not difficult at all! We simply need to make use of the multivariate gaussian, defined as:
$$f(x \mid \mu, \Sigma) = \frac{1}{\sqrt{(2 \pi)^D |\Sigma|}} exp\Big(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\Big)$$Where now $x$ is a vector containing each feature, $\mu$ is a vector holding the mean values of each feature, and $\Sigma$ is the covariance matrix (an $M x M$ matrix where $M$ is the number of features present). If this all seems a little abstract, fear not, we are going to walk through an example to clear up any ambiguity. I am going to add an additional feature, difference between forecasted temperature on day 15 of the month and actual temperature on day 5. This feature will serve as a proxy/heuristic for whether the remaing half of the month will be hotter or colder than the first. As a simple example, imagine the temperature on day 5 of the month was 70 degrees F, while the temperature on day 15 was 85 degrees F. This implies that the temperature in the 2nd half of the month is greater than the first, and hence energy usage may increase by more than a factor of 3 before the month has ended (a factor of 3 would imply that each 10 day interval-first 10 days of month, middle 10 days, and last 10 days-all use the same amount of energy).
Let's see how our
x = np.arange(-5, 5, 0.01)
from scipy.stats import multivariate_normal
cov = [[1.0, 0.0], [0.0, 1.0]]
multivariate_normal.pdf([x,x], [0, 0], cov=cov)
N = 60
X = np.linspace(-3, 3, N)
Y = np.linspace(-3, 4, N)
X, Y = np.meshgrid(X, Y)
pos = np.empty(X.shape + (2,))
pos[:, :, 0] = X
pos[:, :, 1] = Y
mu = np.array([0., 1.])
Sigma = np.array([[ 1. , -0.5], [-0.5, 1.5]])
F = multivariate_normal(mu, Sigma)
Z = F.pdf(pos)
# fig = go.Figure(data=[
# go.Surface(z=Z)
# ])
# fig.show()
x, y = np.mgrid[-5:5:.01, -5:5:.01]
pos = np.dstack((x, y))
rv = multivariate_normal([0.5, -0.2], [[2.0, 0.3], [0.3, 0.5]])
# fig = go.Figure(data=[
# go.Surface(z=rv.pdf(pos))
# ])
# fig.show()
# fig = go.Figure(data=[go.Mesh3d(x=np.arange(0,1,0.01),
# y=np.arange(0,1,0.01),
# z=x*y,
# opacity=0.5,
# color='rgba(244,22,100,0.6)'
# )])
# fig.show()
rv.pdf(pos)
# import warnings
# warnings.filterwarnings('ignore')